Intro

My name is Lior and I am a Data Engineer\Backend Engineer and Computer Science student. As a data engineer, I am familiar with data architectures, pipeline, transformation, cleaning, etc.

This course is an opportunity for me to dive into the data science world and become more familiar with other parts of the "data cycle" - feature engineering, predictive modelling and data visualizations.

Moreover, Natural language processing (NLP) is a field I wanted to learn since I was in a lecture last year. When I found the Kaggle challenge I thought this is a great opportunity for me. The challenge's goal is to build a model that can look at the labelled sentiment for a given tweet and figure out what word or phrase best supports it.

Sentiment Analysis & Keyword Extraction

Sentiment Analysis - A tool to automatically monitor emotions in conversations - it can be on social media platforms, products reviews, chatbots.

Sentiment analysis is done with NLP. NLP is a field that focuses on making natural human language usable by computer programs.

Some examples of NLP implementations are - Search engines like Google, Speech engines like Siri, social websites feed like the Facebook news feed, Spam filters, and more. All of these algorithms understand your interests - each one for its own goal.

From wikipwedia:

Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

Keyword extraction is a text analysis technique to extracts the most important words and phrases from a text. It helps summarize long content to a few words and phrases, to recognize the main topics discussed in a text.

Keyword extraction is being done with machine learning techniques in NLP. We can use it to find keywords from all manner of text: academic articles, business reports, social media comments, forums messages and reviews, news reports, and more.

Imagine you want to analyze thousands of online reviews about your product, and ask questions such as: How many negative reviews relate to delivery time? How many reviews are talking about product quality? Keyword extraction helps you sift through the whole set of data and obtain the words that best describe each review in just seconds, tagging the reviews. With that, you can easily and automatically see what your customers are mentioning most often, make data-driven decisions and saving time for manual processing.

When extracting the main phrases, we are minimizing long text to few words, which makes it easy to understand what the long text is about. With that, we can generate tags, indexing and summarized text.

The challenge

https://www.kaggle.com/c/tweet-sentiment-extraction/overview

The challenge and this project are handling social media data. Our goal is Look at the labeled sentiment for a given tweet and figure out what word or phrase best supports it.

Intro to the "Twitter Culture"

Tweets are not written properly. They are very noisy and significant work has to be done to clean, analyze, and make them meaningful to use. Example from the dataset will be the best to understand:

so i got to my exam centre n they said we can`t let u becuz of your sleeveless top! U cud BELIEVE that!? i had to go home

Analyzing this tweet from a human point of view -

  1. Translating to English - n = and, u/U- you, becuz = because, cud= could
  2. centre-> center. This one seems to be a typo.
  3. we have here !? and BELIEVE is capitalised - which empowers the anger of the writer.
  4. You Could believe that!? it's not a proper way to ask a question in English. It should be written as - Could you believe that?

We can understand this tweet is negative, and I would extract we can`t let u or U cud BELIEVE to best support it.

OK, so we can understand tweets are a mess. They are not the same as documents, or customer reviews - But Y!?!

Partly, it is because tweets need to be short. How short? tweets have a limit of 280 characters

this rule effect is that a long text will be shared as comments ("thread"), But mainly that most of the tweets are short.

Other than that, we have hash tagging (#), @ing (@) , and the desire to stand out with capital LETTERS, emojis 🥑👑, and other marks ?!@#$%.

After understaing this, I am starting with imports and loading the challenge data:

Fisrt, let's look on the data-

The columns of the dataframe:

EDA

Data Distribution

I want to see tags distribution; how many positive/netural/negatives tweets we have in the datasets.

Seems we have 12K of netural tweets, 10K of positive twees and 9K of negative tweets. This is a good overall balance, and this is a basic check- If we would have tons of negative data but minority of positive data, keyword extraction of negative tweets was with better score the from the positive tweets.

Text cleaning

Tweets are very "dirty". They have lower and upper letters mixed, many signs, and URLs. Let's look about the dirty tweets, to learn about what we are dealing with.

We have 1394 rows contain "http://". This makes sense since people are sharing websites, articles, and the latest news.

We have 1457 rows with more than 15 characters which are non-lowercased, space, or punctuation (,.). Some of those tweets are written in understandable language -

SEe waT I Mean bOuT FoLL0w fRiiDaYs... ItS cALLed LoSe f0LloWeRs FridAy... smH

It took me more than 30 seconds to understand what is written here, but there are only 13 words. Those tweets emphasise the "Twitter Culture" we talked about in the intro. Can't wait to see how our models will deal with those tweets, but for now- let's continue with the EDA.

For our EDA- we want to analyze the text without URLs and signs. Our cleaning function will lower all text, remove non-letters and URLs since the main goal of the EDA is to understand the common words used in tweets by sentiment. For that, the clean helps us to "normalize" the text and to get a better understanding of keywords by sentiment.

However, in our model - it's important to not lose that data. For the Modeling, the cleaning will be minimalistic, since the selected_text column has "uncleaned" text- It has capital letters, URLs and non alpha-numeric text such as- !@$#.

Tokenization

Word tokenization breaks text down into individual words. There is also sentence tokenization, that breaks text down into individual sentences.

Tokenization is one of the first step in any NLP pipeline and has an important effect. This is the process of breaking down chunks of text into smaller pieces, that can be considered as discrete elements. This turns an unstructured string into a numerical data structure suitable for machine learning.

Tokenization can separate sentences, words, characters, or subwords. For the EDA, we want to split the text into individual words, word tokenization.

The following code is creating a bucket of words to each sentiment, with counter to each word in the bucket. Notice I remove english languge stop words.

We have the words tokenzied with the number of occurrences of a word. We can build a word cloud and look at the most popular words.

The word clouds can help us to understand the possible keywords to each sentiment in the tweets.

We can recognize positive words in the positive cloud such as good, love, happy, thanks.

For the negative cloud - miss, sad, sorry.

There are some anomalies as well:

Also, we can recognize "Twitter words" such- lol, U, haha.

Words Statistics

After looking on the word clouds, we can add words numerical features to the- avg word length, number of words.

Tweet Length - It is shown here neutral tweets are longet than positive/negatives, and that positive tweets are a bit longer than negatives. The distribution of the number of tweets to word count seems similar across all sentiments.

Text vs Selected Text

I want to explore how many selected words we have compared to the original word count.

When using keyword extraction we want to have the most important words, and not many words, since we want to summarize the text to few words. It is important to us to understand if the dataset tagging is good and the selected text contains small amount of words

This is not as we expected. It seems for positive & Negative tweets most of the selected text is 2-5 words. However, for the Neutral tweets - we have long selected text, which is suspicious. Examine the similarity between those tweets and their selected text can help us understand. I use jaccard similarity for this, as the score in the kaggle challenge.

It's a bug, not a feature

jaccard score of 1 means we have the same text both for the tweet and the extracted text.

It turns out most of the neutral tweets are the same as the selected_text extracted. Unforthantly, seems like a data leak. Searched it in kaggle discussions and other people found that out (and of course, manipulated their models to follow this leak)

If I was part of the competition, the right approach was to return as selected text the whole text for neutral tweets. However, we are on an academic project.

From the EDA we know we have around 12K rows of neutral tweets, out of 31K. The conclusion - for the prediction & score of the selected text, I won't use the neutral tweets.

Another thing we can see here is that positive & negative tweets have around 1000~ tweets each which has a jaccard score of 1 as well. I want to look at those either.

comment- I started to explore this issue when I looked over kaggle notebooks, and saw that people got over 0.60 as score with returning most of the words of the tweet as prediction.

I thought that maybe short tweets have the same selected text as the tweet itself, but from the graph, it doesn't seem to be in this way.

Our main finding is the neutral tweets, which seem to be not good for our goal to extract main words.

This made me think - do we want our selected text to be within a certain limit of word count? Maybe, even this is not a part of the kaggle competition, we want to enforce a limit? Is this limit supposed to be static, or this limit supposed to change with the text size? When I'll explore the models result, I'll look over about the number of words in selected text prediction as well.

Another conclusion - Datasets are not tagged well sometimes. We need to check the tagging to unserstand hoe reliable it is, and many time we need to work with datasets that have tagging problems.

Onehot Feature & PCA

Feature Extraction: In this method, we create new features, and these features are not present in our original feature set. These features are not interpretable. The idea - every word in the tweet can become a feature. Each of the categorical parameters (words) will prepare as a separate column. The feature possible values are 1/0 - 1 if the word appears in the tweet and 0 otherwise. This means we can convert our words to numerical form features. After adding the words feature, I'll use Principal component analysis (PCA) to visualize the data. PCA is the process of computing the principal components and using them to perform a change of basis on the data. After taking the top 1000 used words in the tweets, we can't visualize 1000 features. With dimensionality reduction, it becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D. I want to visualise our tweets, understanding if our tweets are clustered by sentiment - are positive/negative/neutrals tweets will be gathered as a group? For other scenarios, it can be helpful to reduce computation power as well.

PCA Results -

This is a disappointment since I was hoping to see here pretty clusters of negative/positive/neutral tweets.

From reading, there are possible options to PCA not to work well.

PCA cannot be used for sparse data - Our data is very sparsed! The most popular word appears 2401 times, when we have 372168 rows. That is why I also tried to use SVD which in another method to deminsions reduction but the results where similar.

There are more scenarios where PCA isn't working well- If features are completely uncorrelated, the lower dimensional representation obtained using PCA would not preserve much of the variance in the original data and hence would be useless , PCA assumes that the principal components are orthogonal.

Or- the tweets are really similar, and our models will need to work hard.

Modeling

Classification

Using Classification algorithems to classify tweets sentiment, which is a classic sentiment analysis problem. The libraries we'll use are NLTK which is the main NLP library, and sklearn which implements many Machine Learning models.

NLTK Pre-trained Model

With NLTK pre-trained model, we can just predict the sentiment, without training. It uses VADER-Sentiment-Analysis, which is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

This basic method gave us ~49% of accuracy.

Even though this model is for social media data, it had been trained on different kind of data then ours, and we can guess that if we'll train the model on our tweet data, we could probably get better results.

Sklearn - Naive Bayes, Logistic Regression, SGD Classifier

For better classification results, we want to train the models over our data. The models I've chosen to examine:

  1. Naive Bayes classifiers - Multinomial & Bernoulli - Naive Bayes classifiers are a family of simple "probabilistic classifiers" ,the name naive is used because it assumes the features that go into the model is independent of each other. Typical applications include filtering spam, classifying documents and sentiment prediction- our case!

  2. Logistic regression - Logistic regression is named for the function used at the core of the method, the logistic function. Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest (which sklearn library implements) can allow logistic regression to be used for multi-class classification problems.

  1. Stochastic gradient descent (SGD) - An iterative method for optimizing an objective function with suitable smoothness properties. The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule.

Feature extraction - For all the models, we use CountVectorizer to convert the tweets to a matrix, which implements both tokenization and occurrence.

We can see those models gave us a better result than the pre-trained model, as expected. Naive Bayes methods gave us 64-66% of accuracy. Overall, the best classification results are with the SDG classifier, with 70% accuracy, and logistic regression as a close second with 69%.

Another observation we can see is that in all models, neutral class is with the lowest precision, meaning the models predict positive/negative tweets as neutral . This makes sense since in the pos/neg tweets we can recognize tweets that aren't "extremely negative" or "extremely positive". On the other hand, in the recall, it observed that the negative tweets are with the lowest score, which means our models tend to classify negative tweets, even if they are not negative.

I want to look at the predicted dataframe and see where our models are having problems recognizing the right sentiment and why. The "biggest" mistake our model can do is to predict negative tweets as positive, and the positive as negative. How much "big" mistakes our best model, SDG classifier, did?

2.5%~ of the tweets fell to this statistic. I think this is a good number.

I want to look at those mistakes and others, to understand why our models got the wrong predictions to tweets.

This text is tagged as positive, even though in my opnion it was supposed to be negative - it describes missing to a mother that passed away. This seems to be a incorrect tag. The Naive Bayes models predicted this tweet as pos and Logistic Regression and SGD predicted this tweet as neg.

This tweet is tagged as positive. Our Naive Bayes model predicted this tweet as neutral and Logistic Regression and SGD predicted this tweet as neg. In contrast to the previous tweet, I think it is positive as it is tagged.

However, analyzing this tweet it says "the flight was good although I thought it won't be", since the author is exhausted. That is why we can understand the neutral and negative predictions as well. Here is another example of to similar tweet:

Before looking at the tagging and the prediction, this tweet is similar to the previous one since it has 2 sections - the first is positive finally a beautiful sunny day in atlanta, and the second is negative too bad i m stuck inside working. In this tweet, the second part is dominant since the author can't enjoy this beautiful day. What about the tag and the predictions for this tag?

All the models except the SGD predict this tweet as neutral, and the tag is neutral as well. The SGD model and I agree on this tweet this negative.

A conclusion from the tweets I've examined is that tagging a sentiment to a tweet is not black or white. I can think a tweet is positive, while it might sound neutral to another person.

Since we are handling tweets, the writers are not writing experts and their message be read not as they meant. Also, since we have a "neutral" class this classification is harder.

Furthermore, it might be the tagging work isn't good enough, as we saw in the EDA section for the selected text to neutral tweets. This is a problem models in the real world have as well, and data often has low quality labels.

I have had an unsuccessful try to enhance the classification models with the results from all of them together. After predicting the four models I used the following logic - count score for each tweet and sentiment combination For each row, predict the sentiment with the highest score If there is a tie, take SGD prediction. I thought this way would make me achieve better results, but I got around 65%~ accuracy, in contrast to the theory of the determining majority.

Tweet Sentiment - Phrases Extraction

After running classification algorithms, the challenge is to find what words in tweets support positive/negative/neutral sentiment.

As a result of our EDA analysis, I won't predict the neutral tweets. In the kaggle competition, people that understand this bug used it in their favour, and we can imply our results will be lower than the kaggle's competitors.

First try - Word Count Naive Approach

The idea - Similar to the "Word Cloud" from EDA section, I want to create a bucket of words for each sentiment. Statistic calculation, similar to tf–idf, intended to reflect how important a word is in a tweet - and score it by the word sentiment.

When I have the bucket of words by sentiments, I can take the most "positive words" with the following flow -

1.   other_words <- get top neutral & negative words
2.   positives <- all_positive_words - other_words 
3.   top_positives <- positives[:1000]

Each top_positives has the words as a key and the number of occurrences as a value. Of course, the same flow is done for negative words.

When I have this, for each tweet, I am creating all words combinations (Power set). By the tweet sentiment, I can create a score- for each combination, we give a score by the word's value (appearing times). If the words aren't appearing in the sentiment top words dictionary, I reduce the score by a weight.

Finally - The combination with the best score is chosen.

Comment - I thought about the tf-idf solution, and after that found this notebook https://www.kaggle.com/nkoprowicz/a-simple-solution-using-only-word-counts that has a similar approach. It helped me with how to take the "tf-idf" idea to code.

Testing the results - I'll look over all dataset and over only positive & negative tweets.

Since I've used generate_selected_text in clean text, I am comparing clean text.

We can use the "dirty" instead, but our words dictionary will be less efficient.

Results

Overall data isn't reliable, because of the neutral bug we found. The only reason I've tested score overall tweets is to know where I stand compared to the competition results, since neutral tweets are much easier to extract words from. the best score is 0.73615

Kaggle's score is supposed to be similar to the jaccard function. jaccard_cleaned is for understanding how close we are if we are looking over letters and numbers only.

over letters and numbers only.

We can see it has that the score for cleaned data is much better. It's hard to deal with the fact we have in our predictions not only words but signs and punctuations.

This is why this model isn't good enough - it can't handle the signs in the text, since it takes the "most common" word, while the challenge requires to predicts those as well, since it in the "selected_text".

In tweets, words can be written differently - people use "leet" speak (replace Latinate letters with symbols), have typos, and much more Twitter symptoms. So, we'll need a model that can predict with the symbols in the text.

Netural Networks - roBERTa model

While reading about NLP, it was hard to miss the latest innovations of NLP on 2018 - Transformers, and 2019 BERT — Bidirectional Encoder Representations from Transformers.

Google BERT is a pre-training method for natural language understanding that performs various NLP tasks better than ever before. BERT works in two steps, First, it uses a large amount of unlabeled data to learn a language representation in an unsupervised fashion called pre-training. Then, the pre-trained model can be fine-tuned in a supervised fashion using a small amount of labeled trained data to perform various supervised tasks.

RoBERTa Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology. To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.

After reading and examine notebooks on kaggle, I've understood the best result can be with BERT based model. I decided to go with roBERTa model.

We can see that not like the work count model, we don't have here much diffrence betweent the cleaned score method to the uncleaned score method. The results - 47% of jaccard similirity between the selected text and the predicted, for both negative and poritive tweets. This is an improvement to the word count approach, when we got around 25% of similarity.

The model predicted pos & neg tweets with similar prediction score.

Model Predictions Analysis

I want to look over the worst predictions, and try to understand what happened.

First, Look as the negative tweets that have 0 jaccard score, and try to understand why the prediction wasn't right at all.

Total of 233 of those tweets.

From this example, which got "0" score, it seems both options are reasonable. If I should select the words that emphasise this tweet and negative sentiment tweet, I would choose both words: RIP & Killed. But since the tagging was only with "Killed", our model got 0 as a score. But some predictions seem to be deserved the score "0" more than this tweet. For example:

The common thing for those tweets is that the pred word is positive, even though we are looking for a negative phrase. as for 1297, I am not so sure it's a neg tweet, but when assuming it is, we need to extract the neg main words. This can be as a result of training both neg & pos together. If we have many tweets (pos/neg) with this symptom - extracting neg words instead of positive or extracting pos words instead of neg, we can consider running two separate models.

Next, Look at the positive tweets that have 0 Jaccard score, and try to understand why the prediction wasn't right at all.

Total of 281 of those. Let's look at some examples.

The common to all of those - our model wasn't so bad with the prediction, and with some tweets, it was even better than the selected text, but the Jaccard score is 0. Why? looking at those tweets, you can analyze - different chosen words that the tagged selected_text, different symbols extraction (in those tweets the jaccard_cleaned is higher).

But our model also have mistakes that seems less reasonable -

Those tweets' prediction is with common conjunctions english words, and not really positive pharses from the tweets. I am not sure to the reason, but it can related to this words popularity.

Next, even though the jaccard_cleaned wasn't much higher than the jaccard score, let's look at examples when it was-

We could gain a bit higher score predictions with those results, which got lower jaccard then the jaccard_cleaned.

Overall, the predictions seem to be great even when the model got a low score, the prediction sometimes expresses the sentiment keyword extraction from the text.

Conclusions and final words

Results

Classification - Accuracy Results

Classification is an easier problem than phrase extraction, and this is the reason we got higher score results. For getting better results, I could use grid search methods, to optimize the hyper params of all the models. The models could potentially perform better with different hyper parameters.

Phrase Extraction

Word Count

roBERTa

General Conclusions

Self reflection

When starting this project I thought NPL is interesting to investigate. After a while, I understood this choice was better than I thought and how many state-of-the-art advancements are happening in NLP at such a rapid pace. I was partly familiar with Word2Vec concept and RNN networks and started reading about it.

When reading, it was hard to miss the latest innovations of NLP on 2018 - Transformers, and 2019 BERT — Bidirectional Encoder Representations from Transformers.

At first, it was hard to understand what is happening and I've seen too many scary words - "encoder\decoder", "attention", "self-attention", "masking".. After watching this first lecture - https://www.youtube.com/watch?v=rv6ogBCIoC4 it didn't seem too scary anymore, and I've started with modelling and having fun. Now, in my summer semester, I have a seminar and my goal dive into these concepts more.

In my daily work, I am working with researchers a lot and I am happy that I've been given the opportunity to be familiar with the research process, and data science project life cycle. Thank you!